Introduction

  • Notes:
    1. AY2017/2018 Semester 2 and AY2018/2019 Semester 2 bidding data not available.
    2. The bidding statistics are highly non-normal due to being bounded by zero (they cannot make negative bids or have negative bidders). May consider using zero-inflated or poisson regression if considering these statistics as dependent variables.

Phase 1: Setting Up Environment, Packages And Loading Data

Load Module Information

  • Module information was scattered across different folders.
  • Used a loop to repeat the process of downloading and converting to dataframe across the different folders accessed by the different URLs.
    • The same concept was used to consolidate information about the Module Titles.
myModInfo <- data.frame() # create empty dataframe which will act as a container to be populated with data
for(year in c(2011:2018)) # looping through each year
{
  for(semester in c(1,2))
  {
    # create the url where data is to be extracted from
    myurl <- paste0("https://api.nusmods.com/", year, "-", year + 1, "/", semester, "/moduleTimetableDeltaRaw.json")
    myjson <- fromJSON(file = url(myurl))
    for(r in 1:length(myjson)) # for each element in the myjson list, append it to myModInfo
    {
      if(isTRUE(str_detect(myjson[[r]]$ModuleCode, "^PL"))) # only keep info if module code begins with PL
      {
        if(myjson[[r]]$Semester == 1 | myjson[[r]]$Semester == 2) # only get semester 1 and 2 information
        {
          myModInfo <- rbind(myModInfo, myjson[[r]]) # add to dataframe
        }
      }
      myjson[[r]] <- NA # replace the element with NA to free up some rAM
    }
    cat(year, "Semester", semester, "Done!") # progress tracker
  }
}

myTitles <- data.frame() # create empty dataframe which will act as a container to be populated with data
for(year in c(2014:2018)) # looping through each year
{
    myurl <- paste0("https://api.nusmods.com/", year, "-", year + 1, "/moduleList.json") # create the url where data is to be extracted from
    myjson <- fromJSON(file = url(myurl))
    for(r in 1:length(myjson)) # for each element in the myjson list, append it to myModInfo
    {
      if(isTRUE(str_detect(myjson[[r]]$ModuleCode, "^PL"))) # only keep info if module code begins with PL
      {
        if(paste0(myjson[[r]]$Semester, collapse = "|") == "1"|
           paste0(myjson[[r]]$Semester, collapse = "|") == "2"|
           paste0(myjson[[r]]$Semester, collapse = "|") == "1|2") # only keep information from semester 1 and 2
        {
          myTitles <- rbind(myTitles, as.data.frame(myjson[[r]])) # add to dataframe
        }
      }
      myjson[[r]] <- NA # free RAM
    }
}

myModInfo <- myTitles %>% # add titles information to myModInfo
  select(ModuleCode, ModuleTitle) %>% # select these two columns
  filter(ModuleTitle != "Lab in Applied Psychology") %>%
  distinct() %>% # remove duplicates
  right_join(myModInfo, by = "ModuleCode") # left = myTitles, right = myModInfo

saveRDS(myModInfo, file = "myModInfo.RDS") # save to directory

Load myModInfo.RDS

  • Downloading the data from the API using the code above takes a substantial amount of time.
  • I saved the downloaded data in myModInfo.RDS and load the data directly while I worked on the project.

Phase 2: Filter, Transform And Merge

Module Information

  • Filter information from the dataframe myModInfo.
    • Removing non-Psychology modules.
    • Removing modules without module titles, these are modules that appeared before AY2014/2015 and never resurfaced afterwards.
    • Removing information about tutorials.

Filter

Bidding Information myBid

  • Filter information from the dataframe myBid.
    • Removing non-Psychology modules, including Roots and Wings (prefixed with PLS-) and Psychology for non-Psychology students (prefixed with PLB-).
    • Removing information from quotas that are reserved and not available for bidding.
    • Removing information from modules with more than one lecture/seminar session.
    • Removing bidding information from non-psychology students.

Filter

Phase 3: Data Wrangling

  • The variables available in the original data are useful but they are too specific to interpret meaningfully.
  • This section creates new variables based on the original data and allow us to better discern any trend in the data.
  • Also includes additional wrangling and manipulations to ease the plotting of graphs and analysis later.

Phase 4: Data Diagnostics

  • Plot univariate histograms and bivariate plots using loops for almost every combination of variables.
  • The graphs from this section are predominantly for diagnostics rather than exploration, what I mean is that the graphs from this section would make little sense if one tried to draw insights from them. This is because they are aggregated across all other variables.
    • For example: The mean of Bidders is calculated across all academic years, all bidding rounds, all modules…
  • What I am looking out for in this section are odd patterns, like zeroes in places where they shouldn’t be, missing data, highly non-normal data, variables with outliers, etc…

Univariate Descriptive Statistics

## 'data.frame':    1621 obs. of  16 variables:
##  $ AcadYear           : Factor w/ 8 levels "2011/2012","2012/2013",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Semester           : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Round              : Factor w/ 7 levels "1A","1B","1C",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ModuleCode         : Factor w/ 76 levels "PL2131","PL2132",..: 1 1 2 2 3 3 4 4 5 5 ...
##  $ Quota              : num  5 12 35 35 28 50 25 22 25 30 ...
##  $ Bidders            : num  3 42 8 3 7 2 8 5 3 3 ...
##  $ LowestBid          : num  1 205 1 1 1 1 1 1 1 1 ...
##  $ LowestSuccessfulBid: num  1 977 1 1 1 1 1 1 1 1 ...
##  $ HighestBid         : num  368 1255 500 250 1200 ...
##  $ StudentAcctType    : Factor w/ 4 levels "New[P]","NUS[P]",..: 3 1 3 1 3 1 3 1 3 1 ...
##  $ ModuleTitle        : Factor w/ 74 levels "Abnormal Psychology",..: 64 64 65 65 7 7 12 12 16 16 ...
##  $ DayText            : Factor w/ 5 levels "Monday","Tuesday",..: 3 3 2 2 2 2 3 3 1 1 ...
##  $ StartTime          : num  1600 1600 800 800 1200 1200 1400 1400 1400 1400 ...
##  $ Level              : Factor w/ 3 levels "Level 2","Level 3",..: 1 1 1 1 2 2 2 2 2 2 ...
##  $ BidPerQuota        : num  0.6 3.5 0.2286 0.0857 0.25 ...
##  $ Period             : Factor w/ 2 levels "Morning",">=Afternoon": 2 2 1 1 2 2 2 2 2 2 ...

Bivariate Plots

  • Plots to illustrate pairwise relationships amongst variables.

Phase 5: Answering Questions

Is it true that it is easier to bid for a module in the morning?

Do bids become higher as the rounds get later?

Do less people bid for a module if the lecture begins in the morning (before 12pm)?

Lets look at each module and compare the average number of bidders, bidders per quota and lowest successful bids when the lecture begins in and after the morning.

Bonus: Multilevel Modeling

Removal Of Modules With…

Peek Data

Do results from previous rounds…

are there dumping